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Abstract 



Quantum field theories underlie all of our understanding of the fundamen- 
tal forces of nature. The are relatively few first principles approaches to the 
study of quantum field theories [such as quantum chromo dynamics (QCD) 
relevant to the strong interaction] away from the perturbative (i.e., weak- 
coupling) regime. Currently the most common method is the use of Monte 
Carlo methods on a hypercubic space-time lattice. These methods consume 
enormous computing power for large lattices and it is essential that increas- 
ingly efficient algorithms be developed to perform standard tasks in these 
lattice calculations. Here we present a general algorithm for QCD that allows 
one to put any planar improved gluonic lattice action onto a parallel com- 
puting architecture. High performance masks for specific actions (including 
non-planar actions) are also presented. These algorithms have been success- 
fully employed by us in a variety of lattice QCD calculations using improved 
lattice actions on a 128 node Thinking Machines CM-5. 

Keywords: quantum field theory; quantum chromo dynamics; improved 
actions; parallel computing algorithms. 
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I. INTRODUCTION 



It is almost universally accepted that Quantum Chromodynamics (QCD) is the underly- 
ing quantum field theory of the strong interaction which binds atomic nuclei and fuels 
the sun and the stars. Strongly interacting particles are referred to as hadrons, which include 
for example protons and neutrons that make up atomic nuclei as well as a wide variety of 
particles that are produced in particle accelerators and from astrophysical sources. These 
hadrons are made up of quarks and gluons, which are the underlying constituents in QCD. 
The quarks are spin-1/2 particles (i.e., fermions) and the gluons are massless spin-1 particles 
(i.e., gauge bosons). The quarks interact strongly through their "colour" charge through the 
exchange of gluons. The 8 gluons of SU(3) (i.e., one for each generator of SU(3)) themselves 
carry colour and hence interact with themselves as well as with the quarks. This is the 
essential difference between QCD and the corresponding theory of photons and electrons 
referred to as quantum electrodynamics (QED) and has far reaching consequences since the 
theories have entirely different behavior. 

The are very few first-principles methods for studying QCD in the nonperturbative low- 
energy regime. The most widely used of these is the so-called Lagrangian-based lattice field 
theory, which formulates the field theory on a space-time lattice [^,3 . An alternative lattice 
approach is based on the Hamiltonian formulation of quantum field theory and makes use 
of cluster decompositions and again Monte Carlo methods to carry out the simulations . 
In addition, there are numerous studies based on a light-front formulation of QCD |^ and 
much use has been made of Schwinger-Dyson equations to assist with the construction 
of QCD-based quark models. 

The Lagrangian-based lattice technique simulates the functional integral using a four- 
dimensional hypercubic Euclidean spacetime lattice together with Monte Carlo methods 
for generating an ensemble of gluon field configurations with the appropriate Boltzmann 
distribution exp(— S'g), where Sq is a discretized form of the QCD gluon action on the 
hypercubic lattice. The simplest discretizations of the QCD action involve only nearest 
neighbours on the lattice and have O(a^) errors, where a is the lattice spacing. Improved 
actions represent a major advance for the field of lattice gauge theory, where by using 
increasingly non-local discretizations of the QCD action we can obtain the same accuracy 
with far fewer lattice points and hence far less computational time and effort. The purpose 
of the present work is to describe an algorithm which allows us to implement an arbitrarily 
improved (i.e., arbitrarily non-local) action in an efficient way. For further details on the 
state of the art lattice QCD techniques see for example Ref. [g]. Another related and 
equally important advance is the technique of nonperturbative improvement (e.g., mean- field 
improvement) which corrects for some of the major nonperturbative effects (the so-called 
tadpole contributions) and hence more quickly brings the lattice results to their continuum 
form by improving the matching with perturbation theory at a given lattice spacing a 0. 
It is the combination of improved actions and nonperturbative improvement that together 
have come to represent a significant advance for the field PJTI1|]. 

Lattice QCD is based on a Monte Carlo treatment of the path integral formulation, 
which makes it a computationally demanding method for calculating physical observables. 
The gluon field is represented by 3 x 3 complex SU{3) matrices, where there is one such 
SU{3) matrix associated with every link on the lattice. The links lie only along one of the 
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four Cartesian directions and join neighbouring lattice sites. Since all lattice links require 
identical numerical calculations, lattice gauge theory is ideally suited for parallel computers. 

There are various types of improved actions and, as explained above, these are all based 
on the idea of eliminating the discretization errors that occur when passing from continuum 
physics to the discretized lattice version. The simplest (i.e., non-improved) gluon action 
is the so called standard Wilson action and consists of 1 x 1 Wilson loops or as they are 
frequently called plaquettes. We shall often refer to the Wilson loops used to build up lattice 
actions as plaquettes. The need to build the gluon action out of closed loops arises from 
the need to maintain exact SU{3) gauge invariance in the discrete lattice action. This 1x1 
loop action was first proposed by Wilson |Tl[] in the early 70 's and has been used extensively 
over the years. It consists of taking an arbitrary starting site, say x, on the lattice and 
stepping around a 1 x 1 loop until returning to the starting point x. The 1x1 Wilson loop 
is illustrated in Fig. |^. 
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FIG. 1. The 1x1 plaquette Usq{x) with base at x lying in the /xi^-plane. The lattice spacing is 
denoted by a. 



Improving the standard Wilson action is achieved by making use of larger loops (e.g., 
1 X 2, 2 X 2, etc.) in the lattice gluon action to eliminate finite lattice spacing artifacts 
to a given order in . For an elegant and detailed discussion of these topics see Ref. 
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In this article we present an efficient and completely general algorithm that permits one 
to calculate any improved planar lattice action at any desired level of improvement. By 
"planar" here we mean that we will consider actions containing two-dimensional loops of 
arbitrary size which lie in any of the Cartesian planes, (i.e., the x — y, x — z, x — t, y — z, y — t, 
OT z — t plane). This algorithm has been used in a wide variety of improved action lattice 
simulations to date . For example, it has been used in studies of the topological structure 
of the QCD vacuum and the calibration of the various cooling and smearing techniques 
the study of discretization errors in the Landau gauge on the lattice , and studies of the 
static quark potential ]15|. It is currently being used in studies of the gluon propagator |l^ 
and highly improved actions |T^. 

In Sec. |l|, we briefly describe the tree-level improved action that we have been using 
in our calculations [T^-|T^ with our algorithm on a Thinking Machines CM-5. Sec. |T| 
gives two possible ways of using the technique for the standard Wilson action. The form of 
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the algorithm appropriate for the first level of improvement (i.e., involving a combination 
of the elementary square 1x1 plaquette and the rectangular 1x2 plaquette) is given in 
Sec. fV[ Then in Sec. [V| we present the general algorithm suitable for an arbitrarily non- 
local action, (i.e., for an n x m Wilson loop with n and m being arbitrary positive integers). 
Sec. |V| addresses non-planar issues encountered in nesting specific planar actions. Actions 
involving non-planar loops are also addressed. Finally, in Sec. |V11| we present our summary 
and conclusions. 



II. GAUGE ACTION, MASKING, AND PARALLEL COMPUTING 
A. Lattice Gauge Action for Colour SU (3) 

The standard Wilson action for the gluons is given by 



sq 



1 _ i7^eTrf/,q(x) 



(1) 



and a simple tree-level (9 (a^) -improved action (i.e., the action with the first level of im- 
provement) is defined as 

Sg = y 5^7^eTr(l - U,^{x)) - ^ ^7^eTr(l - U,,,,{x)) . (2) 

sq ^ rcct 

The 1x1 square (or plaquette) Usq{x) and the 1x2 rectangle f/rcct(a;) are defined by 

f/sq(x) = U^{x)U,{x + (i)Ul{x + v)Ul{x) (3) 
Urect{x) = Uf,{x)U^{x + fl)U^{x + 9 + fi)Ul{x + 2u)Ul{x + 0)Ul{x) 

+ U^{x)U^{x + fi)Uy{x + 2fi)Ul{x + fi + v)Ul{x + v)Ul{x). (4) 

Here the variables and v are the direction in which the links are pointing inside the 
lattice space. There are four directions for a four- dimensional hypercubic lattice. The link 
product f/rect(a;) denotes the rectangular 1x2 plaquettes and Uq is the tadpole improvement 
factor, commonly known as the mean-field improvement factor which largely corrects for 
quantum renormalization of the links. In our numerical studies we have typically employed 
the plaquette definition of the mean-field improvement factor 

no = Q7^eTr(f/,q)^ ' . (5) 

For the improved action in Eq. (H) the residual perturbative corrections after mean-field 
improvement are estimated to be of the order of two to three percent ||TB[. Of course, both 
Eqs. (|l|) and @) reproduce the continuum gluon action as a — 0, where P = and g 
is the QCD coupling constant at the scale a. It is useful to note that our (3 = G/g"^ differs 



from the convention of Refs. |]T8|-[20[]. A multiplication of our (3 in Eq. (^ by a factor of 5/3 
reproduces their definition. 
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Let us comment on the lattice configurations that we have generated with the general 
algorithm described here and which we have used extensively in Refs. [|r3 -|T7|. The gauge 
configurations are generated using the Cabbibo-Marinari [^] pseudo-heat-bath algorithm 
with three diagonal SUc{2) subgroups. All calculations are performed using a highly parallel 
code written in CM-Fortran and run on a Thinking Machines Corporations (TMC) CM-5 
with appropriate link partitioning. For the standard Wilson action we partition the link 
variable in a checkerboard fashion. While all calculations to date have been for SU{3), there 
is no restriction in the algorithm on the number of colours for the gauge group |^ and we 
could just has easily have treated the case of SU{N). 

The mean-field improvement factor was updated on a regular basis during the simulation. 
Once the lattice is thermalized from a cold start, (after at least five thousand sweeps), the 
Mo factor is held fixed during the generation of the ensemble of gauge field configurations. 
The ensemble is built up by sampling the fields with a separation of at least 500 Monte 
Carlo sweeps over the entire lattice to ensure that they are sufficiently decorrelated. For the 
case of the standard Wilson action, configurations have been generated on a 16^ x 32 lattice 
at /5 = 5.70 and a 24^ x 36 lattice at (3 = 6.00. For the improved action of Eq. (|^) we have 
generated 8^ x 16, 12^ x 24, 16=^ x 32, and 24^ x 36 lattices with p values of 3.57, 4.10, 4.38, 
and 5.00 respectively. 



B. Masking and Parallel Computing 

When performing a Monte Carlo sweep of the entire lattice each lattice link must be 
updated individually using the particular gluon action of interest (e.g., 5*^). The action is 
used in the Monte Carlo accept/reject step for that link in order that detailed balance is 
ensured at each link update and hence that it is ensured throughout the entire lattice sweep. 
It is the combination of randomness in the link updates, the maintenance of detailed balance, 
and decorrelation (ensured by large sweep numbers between the taking of samples) that 
ensures the desired ensemble of gauge field configurations are produced with the Boltzmann 
distribution exp(— S^). 

In the most naive procedure we move through each link on the lattice consecutively 
updating them one at a time until we have completed a "sweep" through the entire lattice. 
We then repeat these lattice sweeps as often as required. This simple procedure is highly 
inefficient on a parallel computing architecture, where we can be updating many links at 
the same time. However, there is a fundamental limitation to this parallelism, i.e., we will 
violate detailed balance and corrupt our data if we try to simultaneously update a link while 
information about that link is being used in the update of another link. It is crucial that we 
identify which links can be updated simultaneously and this is determined by the degree of 
non locality in the action. For example, for an action which contains only nearest neighbor 
interactions of the links, such as the Wilson action, we can use an efficient "checkerboard" 
algorithm, which will be described below. In general, the more non local is the lattice 
gluon action the fewer are the links that can be simultaneously updated. We see that the 
improvement program is therefore more expensive to implement, but the benefit of improved 
actions far outweighs this drawback. 

In order to facilitate our discussions we will refer to the concept of "masking" , where the 
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lattice links not eliminated by the mask are the ones that can be simultaneously updated in a 
parallel computing environment. The number of independent masks needed for a particular 
action determines an upper limit to the parallelism that can be used in a single lattice sweep. 
As we will see, the best that can be done is to have two masks per link direction and this is 
for the case of nearest neighbor interactions only. 

We will simplify the presentation in the usual way by rescaling all dimensionful quantities 
by the lattice spacing a. This is equivalent to setting a = 1. 



III. MASKING FOR THE STANDARD WILSON ACTION. 

In the standard Wilson action, where only neighbouring links are connected by the action, 
we need only two masks for each of the four link directions. There are two different ways of 
implementing this masking as we will now discuss. 



A. Checker Board Masking. 

The standard Wilson action only involves 1x1 Wilson loops (depicted in Fig. 0) and is 
the most fundamental lattice gluonic action. Whenever a given link is being updated, we 
must not be attempting to update any of the links within any of the 1x1 plaquettes which 
contains the given link. Consider the link from the lattice site x to x + /i, where /i is one of 
the four Cartesian unit vectors x, y, or t. We see then that the plaquette in Fig. |I| forms 
a "staple" consisting of three links in the /i-z/ plane which is attached to the link of interest 
U^{x). [Note that we are sometimes using x as a shorthand notation for the space-time 
lattice point = (x, z, t) as well as for the x-coordinate on the x axis. The meaning 
should be clear from the context.] We could equally well consider the plaquette and staple 
below the link U^{x) in the figure, which also lies in the fi-u plane. In addition, for a given 
Cartesian direction fi, there are three possible choices for z/, i.e., there are three orthogonal 
planes which contain the link and two staples per plane. 



'9 y+2 

A 

-«» y+1 

A 

y 



x+2 



FIG. 2. Checkerboard masking as seen in an x-y plane of the lattice when using the standard 
Wilson action. The highlighted links with arrows can be updated simultaneously. 
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Let us consider, for example, all of the links in the x-y plane which are oriented in the x 
direction. We can see from Fig. |^ that we can choose a "checkerboard" of such links that can 
be updated at the same time without interfering with each other. These links are indicated 
in the figure as highlighted links with arrows. It is easy to see that none of the links to be 
updated lie in any of the staples for the other links to be updated and that exactly half of 
the x-oriented links in this plane can be simultaneously updated at one time. 
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FIG. 3. Rotating the 1x1 plaquette sitting in the xy plane about the x-axis into the xz and 
xt planes. 

We have identified one of the lattice sites in Fig. |^ as the site x. If the the link variable 
Ux{x) is to updated then from Fig. ^, it is observed that the link variables in the x direction 
that can be simultaneously updated are Ux{x + 2x), Ux{x + 4x) and so on. So every second 
link along the x direction can be updated at the same time. Now let us consider stepping in 
the ij direction. We again see that every second link in that direction can be simultaneously 
updated. By symmetry the same must also be true for the z and t directions as depicted in 
Fig. 1^, where we have used a broken dash-dot line to try to indicate the fourth dimension 
(i.e., for the links that lie in the x-t plane). We see that for the link pointing in the x 
direction, the plaquettes (and staples) in the x-y, x-z, x-t planes are all related by simple 
rotations about the link. Thus we see that we have now built up a four- dimensional mask 
for determining which links pointing in the x direction can be simultaneously updated. 

Let us introduce some convenient shorthand notation. If for a given link pointing in 
the direction n, we must take n steps in the direction u before reaching the next updatable 
link pointing in the direction /x, we will use the notation /i : z/ ~ nu. For our checkerboard 
masking we see that for a link pointing in the direction x we have to take two steps in each 
of the Cartesian directions before reaching the next updatable link. Hence we write 



X : 



X ~ 2x , y ^ 2y , z ^ 2z and i ~ 2i. 



(6) 



We immediately see that this is also true for links oriented in the y, z, and i directions so 
that 



X 2X , 

X ~ 2x , 



y ^ 2y , z ^ 2z and t ~ 2t, 
y ~ 2y , z 2z and i ~ 2t, 
^ ~ 2y , z 2z and i ~ 2i. 



(7) 
(8) 
(9) 
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Finally, note that when we wish to update all of the links pointing in any one of the 
four Cartesian directions, say /i, we need only two four- dimensional masks. This is because 
exactly half of the /z-oriented links across the entire lattice are considered in each four- 
dimensional mask. To appreciate this we simply note that for any one of the Cartesian 
directions one mask can be turned into the checkerboard complement mask for that direction 
by shifting the mask by one step in any Cartesian direction, (see Fig. |]). So to update all of 
the links on the lattice we need a total of 8 four- dimensional masks, i.e., 2 masks for each of 
the four Cartesian directions. In other words, no matter how many nodes we have available 
on our parallel computing architecture a full lattice updating sweep will require 8 serial 
masked sweeps to complete with a nearest neighbour action (such as the Wilson action) and 
checkerboard masking. This is the conventional procedure for the standard Wilson action 
in lattice QCD studies. In closing this section on the standard Wilson action let us observe 
m Sec. pTB| that there is an alternative and equally good "linear" masking for this case. 



B. Linear Masking. 



As an alternative approach to the checker board masking described in Sec. |III A| , one 
could partition the links over the lattice in a linear fashion as shown in Fig. HI If the link 
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FIG. 4. Linear masking of the lattice when using standard Wilson action. The highlighted 
arrows represents the link variable that can be updated simultaneously. 

variable of interest is U£{x) then the next possible link variable in the x direction which can 
be updated is the Ux{x + x) link and then the Ux{x + 2x) and so on. We see that all the links 
on the X line can be updated at the same time, since none of these links are contained in 
the 1x1 plaquettes for the other links in the line. Hence we have x : x ^ Ix. Now looking 
in the y direction, we realize that we cannot touch the Ux{x + y) link because it is part of 
the Wilson loop containing the link variable Ux{x) which is being updated simultaneously. 
However, the links Ux{x + 2y), Ux{x + 4y), etc. can be updated. Consequently, we have 
X : y ^ 2y and similarly for steps in the z and t directions. For a link variable pointing in 
the X direction we then have that 



X : 



X 



Ix , y ~ 2y , 



2z and t ~ 2t. 



(10) 
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When the hnks to be updated are pointing in the other three directions we have 
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(11) 

(12) 
(13) 

for the y, z and t directions respectively. 

Again, we see that there are two complementary linear masks for links pointing in any 
given Cartesian direction fi. One mask can be obtained from the other by a shift of one step 
in any of the three Cartesian directions orthogonal to fi as can be appreciated from Fig. ^. 
Thus this linear masking is equally as efficient as the checkerboard masking of the previous 
section, since there are 2 masks for each of the 4 Cartesian directions giving a total of 8 
masks. 



IV. MASKING AN IMPROVED ACTION. 

In this section, we describe the necessary masking procedure for a first-level improved 
action involving 1x1 and 1x2 Wilson loops. In particular, in this section we are describ- 
ing the masking suitable for the improved gauge action of Eq. (El), which has been used 



extensively by us UlSHT^. Let us again begin by considering the link variable beginning at 
some lattice site x and pointing in the x direction, i.e., Ux{x). We now need to consider 
both Fig. ^ for the elementary 1x1 square plaquette and Fig. § for the 1x2 rectangular 
plaquette. In Fig. ^ we show all of the 1x2 rectangular plaquettes which contain the link 
Ux{x), which is shown as the highlighted horizontal link in the three parts of this figure. 
Visualizing a four dimensional object on a fiat piece of paper can be, to a certain extent, 
an artistic challenge and so we have again used a dash-dot line to indicate links lying in 
the x-t plane. There are three distinguishable ways to include this link in a 1 x 2 plaquette 
(the three parts of the figure) and for each of these there are two (mirror-image) rectangles 
per Cartesian plane and four Cartesian planes. All links in Figs. |^ and |^ with arrows (other 
than the link Ux{x) itself) must be omitted from the mask when updating this link with our 
improved action. We see that there are many excluded links. 

In Fig. ^ we show which links can be simultaneously updated with the link Uj:{x). We 
can immediately write down by inspection from this figure that 

X : x ~ 2x , y ~ 3y , z 32 and t ~ 3t . (14) 

This follows since the z and i cases are identical to the y case for this x-oriented link. The 
generalization to the other orientations of the links to be updated is straightforward by 
symmetry 



X ~ 3x , y ^ 2y , z ^ 3z and i ~ Si, (15) 
X ~ 3a; , y ~ 3y , z ~ 25 and i ~ 3t, (16) 
X ~ 3x , y 3y , z 3z and £ ~ 2t . (17) 



Let us return to the particular case of the masking for x-oriented links. From Eq. ([M]) 
we see that there is symmetry between the y, z, and t directions and so we will begin by 
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FIG. 5. The set of all possible 1x2 plaquettes containing the link Ux{x). The dashed-dotted 
line is to be understood as being in the x-t plane. 
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FIG. 6. The highlighted links with arrows are the ones that can be simultaneously updated for 
an action containing both 1x1 and 1x2 plaquettes. 

constructing suitable masks for any given equal-x hyper-plane, i.e., for the three dimensional 
space spanned by the unit vectors y, z, and t. 

Before attempting this, let us first consider Fig. ^ and extend this to three dimensions 
by imagining that the i-axis is pointing directly out from the page. We shall temporarily 
neglect the t direction, which is equivalent to simply taking a slice of the four-dimensional 
lattice with the same value of t, (i.e., an equal-t hyper-plane). Now let us view this three- 
dimensional lattice by looking along the £-axis at one particular equal-x plane. We will then 
be presented with end-views of updatable links in the y-z plane. For every fixed value of 
z there are three different masks needed for y and vice versa. Also, there is no restriction 
on simultaneously updating diagonally shifted links, since we are only considering planar 
actions at this point. It is not difficult to see that we can cover all of the nine lattice links 
that need to be updated with three orthogonal masks as shown in Fig. |^. In this figure 
x-oriented links which can be updated at the same time are indicated by a solid dot. Note 
that each of these masks is related by a diagonal shift of the nine-point lattice "window" . 

We can now also extend this thinking to include the t direction, by stacking the three 
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FIG. 7. Schematic illustration of the lattice masking when using the 1x2 plaquette improved 
action. 
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FIG. 8. Illustration of the cyclic plane rotation in the improved masking. 



two-dimensional y-z masks on top of each other as shown in Fig. ||. We must stack the 
planes so that when viewed along any of the three axes the solid dots in any one Cartesian 
planes always have the appearance of one of the planes in Fig. |^. We see that this can be 
achieved in three ways by the stacking in Fig. ^and its two cyclic permutations. These three 
three-dimensional masks when summed give the identity (i.e., the sum includes all points) 
and are orthogonal to each other (i.e., the sum includes all points only once). 

We can now give a simple geometrical picture of what we are doing, which will simplify 
the generalization that we give in the next section. For x-oriented links, the directions y, z, 
and t directions are all symmetrical and each direction requires a step of 3 to reach the next 
updatable link. Hence, we need to construct a complete set of orthogonal masks in three 
dimensions for a 3 x 3 x 3 cube, where no two points in the cube lie on the same Cartesian 
axis (i.e., only diagonally related points). This is simple to do. Let us consider the bottom 
plane (i.e., plane 1) of Fig. | and connect the three solid dots by a diagonal line. We see that 
plane 2 is obtained from plane 1 by a diagonal shift of this line by one diagonal half-step, 
and similarly for plane 3. In visualizing this it may help to imagine surrounding the cube 
by many identical copies of itself and moving the diagonal line through diagonal half-steps 
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across all of these cubes simultaneously. All three three-dimensional masks are obtained in 
the same way but start with plane 1, plane 2, and plane 3 respectively. 

So for the f-oriented links we need 3 masks for each equal-x hyper-plane (i.e., a three- 
volume here) and we have two independent equal-x hyper-planes, giving a total of 6 masks 
for each Cartesian direction for the link orientation. Since there are 4 orientations, then 
there is a total of 24 masks needed for an action containing both 1x1 and 1x2 plaquettes. 
Thus a single lattice sweep must take at least 24 sequential serial calculations even on the 
most parallel computing architecture. 

The masking procedure outlined here for this action can only be implemented when 
the number of lattice points in each dimension is a multiple of three. Inspection of Fig. |^ 
reveals the periodicity of three is required to maintain separation of links at the boundary. 
Since simulations are usually carried out on lattices with even numbered sides, this restricts 
the length of the lattice sides to multiples of six. Fortunately, multiples of four are easily 
obtained as described in the next section. Moreover, Sec. |VI| reports a high-performance 
mask for this action with a periodicity of four. 

It is interesting to note that when implementing this masking procedure on the CM-5 we 
achieved optimum performance by calculating the updates for all links on the lattice and by 
then only implementing those updates that were appropriate for the particular mask being 
used at the time. In other words for the lattices that we have studied so far on the CM-5 
it was more efficient to calculate link updates that were never used, than it was to split the 
masked links over the various processor nodes and update only these masked links. This 
was due to the fact that there was a large overhead of communication time in assigning the 
masked links across the processors. The point of this observation is that the optimal use of 
the masks will in general depend on the details of the parallel computing architecture being 
used. 



V. MASKING THE LATTICE WHEN USING A GENERALIZED IMPROVED 

ACTION. 

We can now generalize the algorithm presented in Sec. |V|for arbitrarily improved planar 
actions. Let us begin as before by considering the update of links oriented in the x-direction. 
Let us assume that we have an action with n x m links where the n refers to the x direction 
and the m refers to the y, z, t directions. We will eventually argue that only the rimax x ""-max 
case, where nmax is the greater of n and m, is necessary in the general case. As shown in 
Fig. 1^ the nearest simultaneously updatable links are separated by n steps in the x direction 
and (m + 1) steps in the other three Cartesian directions. 

Hence we see that we can write in our notation for the four Cartesian orientations of the 
links that 

X : X ~ , y ~ (m + l)y , i ~ (m + l)z and t ~ (m + l)t, (18) 
y : X ~ (m + l)x , y ny , ;z ~ (m + l)z and t ~ (m + l)t, (19) 
z : X ~ (m + l)x , y ~ (m + l)y , z ~ and t ~ (m + (20) 
i : X ~ (m + l)x , y ~ (m + l)y , i ~ (m + 1)5 and i ni . (21) 
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FIG. 9. The highlighted hnks with arrows are the ones that can be simultaneously updated 
for an action containing up to n x m plaquettes, where here n refers to the x direction and m 
applies to the other three Cartesian directions. 



We can now follow the arguments of the previous section. Let us consider a fixed-x 
hyper-plane (i.e., three- vo lume) . In place of a 3 x 3 three- volume we will now need an 
[m + 1) X [m + 1) X [m + 1) three-volume. Furthermore, we will need a complete set 
of orthogonal and diagonal masks for this. Let us again look along the x direction at a 
fixed if: plane for now, i.e., we are looking at a y-z plane as in Fig. Let us refer to 
the {m + 1) X [m + 1) two-dimensional plane with the updatable links (solid dots) along 
the diagonal as plane 1. Then we can generate the other m two-dimensional planes by 



diagonal half-shifts as before as depicted in Figs. |TT] and |T2|. We can then sequentially stack 
these planes in the t direction as before to form the first of the three-dimensional masks. 
The other m three-dimensional masks are then generated from this first mask by the cyclic 



permutations of the m+1 planes as in Sec. Hence we have generated the desired complete 
set of (m + 1) orthogonal three-dimensional diagonal masks. 



Plane 1. 




x,y,z 

1 ... 

z z+1 z+2 z+3 z+m-3 z+m-2 z+m-1 z+m 

FIG. 10. Plane 1 with the (m + 1) updatable sites on the main diagonal of the y, z plane. 



So for each fixed x-hyper-plane (i.e., three volume) we need (m + 1) masks. We will need 
such a set of masks for the n values of x. The general result is that for updating the links 
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FIG. 11. Plane 2. 




FIG. 12. Plane 3. 



oriented in the x direction we need a total of n x (m + 1) masks and we have seen that 
the construction of these masks is straightforward. The construction of the masks for the 
other Cartesian orientations of the links proceeds identically. This total number of masks is 
^mask = 4 X n X (m + 1). The periodicity of the mask is governed by the last factor, (m + 1), 
and the lengths of the lattice dimensions must be a multiple of this number. The reason for 
this is that if this were not the case then the imposition of the necessary periodic boundary 
conditions would cause link collisions, where a link being updated uses one or more other 
links which are simultaneously being updated. 

Any improved lattice action of physical interest must be both Z4-symmetric (i.e., sym- 
metric under the arbitrary interchange of the four Cartesian directions) and translationally 
invariant. Thus for such actions every link will find itself occurring in every possible position 
for every plaquette in the improved action. We then see, as we did in Sec. ^ and Fig. |^, 
that the number of steps needed in each direction is determined by the longest plaquette side 
appearing in the action. Let us denote the longest plaquette side appearing in the action as 
^max- Then we see that the number of steps needed in the various Cartesian directions is 
given by 

X: X ~ nmax^; , y ~ (?^max + 1)^ , ^ ~ ('^max + 1)^ and t ~ (Wmax + 1)^, (22) 

y : X ~ (rimax + l)x , y ~ n^^^y , z ~ (^max + l)z and i ~ (n^ax + 1)^, (23) 

Z: X ~ (n^ax + 1)^^ , y ~ ('^max + 1)^ , ^ ~ ^max^ and t ~ (n^ax + 1)^, (24) 
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i : x~(ramax + l)x , y~(Timax + l)y , z ~ (^max + 1)^ and t n,^^J . (25) 
Hence the number of masks in general for an improved action will then be given by 

'^mask = 4 X TT-max ^ ('^max +1) (26) 

and the lattice will need the length in each dimension to be an integral multiple of (rimax + l)- 
It is useful to note that the linear masking for the standard Wilson action is the one 
that is extended initially in Sec. |IV| and is subsequently generalized in this section. For 
the standard Wilson action (i.e., 1x1 plaquettes only) we see that nmax = 1 and hence 
^mask = 4x1x2 = 8 as we found for the linear (and checkerboard) mask. For the improved 
action that we have studied (i.e., 1x1 and 1x2 plaquettes) we have n^ax = 2 and hence 



^mask = 4x2x3 = 24 or 6 masks per link direction as found in Sec. However, this way 
of proceeding for the plaquette plus rectangle improved action would require each lattice 
dimension be a multiple of (nmax + 1) = 3, but since we also typically want our lattices to 
have even lengths then that means each side of the lattice would need to be a multiple of 6 
in length. Since the result in Eq. (pGf) is a lower bound, we can of course always choose to 
enlarge the period of our masking by choosing nmax + 2 for the last factor in Eq. |2B| rather 
than nmax + 1- This will still ensure that no link collisions occur. For example, for the 
plaquette plus rectangle improved action we can use (nmax + 2) = 4 instead of (nmax + 1) = 3 
in Eq. (p6D, so that any lattice lengths which are multiples of 4 become available at the cost 
of requiring 32 masks rather than 24. Fortunately, for this case a more efficient mask can 
be realized and will be presented in the next section. 



VI. NON-PLANAR CONSIDERATIONS 

We have presented a method for identifying links which may be simultaneously updated 
during Monte-Carlo updates or cooling sweeps. The generality of the algorithm allows one 
to parallelize link updates for planar actions of any degree of non locality. In this section 
we extend this analysis to a few special cases of actions in which out-of-plane considerations 
are necessary. Both cases are centred around the plaquette plus rectangle action of Eq. (H) 
in which 1x1 and 1x2 Wilson loops are considered in the action. Such actions dominate 
current improved gauge action analyses. 

In Sec. |rV| we illustrated how such an action can be masked through the consideration of 
an elementary 3x3x3 cube in which one-third of the links may be simultaneously updated. 
However, only every second link in the direction of the links is updated simultaneously as 
illustrated in Fig. ^. Hence six masks per link direction are required. 

Here we consider an alternative masking specialized to the 1x1 and 1x2 Wilson 



loop actions. Fig. |T3| illustrates the manner in which these Wilson loops may be nested, 
such that one need not restrict the mask to every second link in the direction of the links 
being updated. This technique will reduce the number of masks by a factor of two, at the 
expense of considering an elementary 4x4x4 cube in which one-quarter of the links may be 
simultaneously updated. Fig. ^ displays the four planes to be cycled through in which the 
links to be updated simultaneously are indicated by the solid dot. Hence only four masks 
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FIG. 13. Two elementary cells for an action involving 1x1 and 1x2 Wilson loops are nested 
together such that one need not restrict the mask to every second link in the direction of the 
links being updated. The links with the positions labeled are the ones that can be simultaneously 
updated. The out of plane plaquette-plus-rectangle illustrates additional links that cannot be 
simultaneously updated. 



per link direction are required. Moreover, the lattice dimensions (usually even numbers) can 
now be multiples of four as opposed to six. 

The out-of-plane considerations required for the nested action are also indicated in 
Fig. [l^. Hence it becomes apparent that not only the three links at {x,y + 1), {x,y + 2), 
and {x,y + 3), be avoided, but also the links two-steps in a direction orthogonal to the link 
direction and one step in a third direction (similar to moves of a Knight on a chess board) 
must be avoided. 

Inspection of the four planes to be cycled through in the elementary 4x4x4 cube 
displayed in Fig. [l^ indicates that such Knight moves are already avoided in this mask. 
However, it also becomes clear that the ordering of the planes is crucial. For example 
interchanging the positions of planes 2 and 3 would cause "link collisions" within the nested 
mask. 

Finally we consider non-planar actions in which one step out of the plane of the 1x1 and 
1x2 Wilson loops is required. Such non-planar paths are introduced to eliminate small but 
finite O{g^o?) errors where g is the gauge coupling constant. The six-link paths commonly 
referred to as the "chair" and "parallelogram" introduce a link parallel to that being 
updated which is one-step orthogonal to the link direction and one step in a third direction. 

Inspection of Fig. |l^ indicates that such 1 by 1 moves eliminates fully two of the four 
planes and half of the parallel sites on each surviving plane. An example of four of the sites 
which may still be updated in parallel are indicated by the circled sites in Fig. |1^. As a 
result there are now 16 masks required per link direction instead of 4. Now a total of 64 
masks is required for this action which is still regarded as rather local. 

The introduction of even the most local non-planar paths can have a serious detrimental 
effect on the level of parallelism that is possible. It is easy to see that one can rapidly 
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FIG. 14. The four planes to be cycled through in the elementary 4x4x4 cube. One-quarter 
of the links may be updated simultaneously and are indicated by the solid dot. The circled sites 
are an example of the sites surviving when the out of plane "chair" or "parallelogram" link paths 
are included in the action. 

eliminate all sites in an elementary n x n x n cube with non-planar loops, leading to r? 
masks per link direction. 



VII. SUMMARY AND CONCLUSION 

We have briefly described the concept of improved actions and have explained the impli- 
cations of the non locality arising from the improvement program for the implementation of 
these actions on parallel computing architectures. We have characterized these implications 
in terms of the number of masks, which in turn determine the minimum number of serial 
calculations needed to perform a Monte Carlo updating sweep over all of the gluon links 
on the lattice. We have systematically built up a completely general algorithm using masks 
that allow one to put any planar improved lattice action on a parallel machine in an efficient 
way. The generalized masking construction are given in Sec. |V|. 

Non-planar considerations encountered in nesting speciflc planar actions and actions in- 
volving non-planar loops have also been addressed. We hope that the methodology presented 
will allow one to flnd an efficient parallel mask for any desired action. We are currently test- 
ing our algorithms on some highly improved actions and will be reporting the results of these 
studies ITTIl in the near future. 
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